For this project, we would like to simulate that Team Outlier is a team of Data Scientists that Capital Bikeshare company hired to come up with data driven answers to help them with their decisions.
The following business questions will be guided by the two-year historical log corresponding to years 2011 and 2012.
The company need to reduce operational expenses by looking at the manpower to cater extra demand. Hence Team Outlier will predict the total ridership count based on different variables. In order to predict exact operational expenses this model will try to answer below questions -
Capital Bikeshare would like to have a targeted marketing strategy to help them efficiently spend their budget. The Team Outlier needs to determine the recommended season to have marketing promotional offers to increase the number of customers.
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available. Our source (UCI Machine Learning Repository) aggregated the data on two hourly and daily basis. They extracted and added the corresponding weather and seasonal information. Weather information were extracted from http://www.freemeteo.com.
There are two datasets – hour.csv and day.csv. Both datasets have the same fields except hr which is not available in day.csv. The hour.csv contains bike sharing counts aggregated on hourly basis and it has records of 17379 hours. The day.csv contains bike sharing counts aggregated on daily basis and it has records of 731 days.
library(readr)
day_data = read.csv("dataset/day.csv")
#head(day_data)
hour_data = read.csv("dataset/hour.csv")
#head(hour_data)
# Convert Season to Factor
day_data$season = as.factor(day_data$season)
# Set Seasons
levels(day_data$season) = c("Spring", "Summer", "Fall", "Winter")
# Convert Holiday to Factor
day_data$holiday = as.factor(day_data$holiday)
# Set Holiday
levels(day_data$holiday) = c("No", "Yes")
#Convert WorkingDay to Factor
day_data$workingday = as.factor(day_data$workingday)
levels(day_data$workingday) = c("No", "Yes")
#Convert Weather to Factor
day_data$weathersit = as.factor(day_data$weathersit)
levels(day_data$weathersit) = c("Clear", "Mist", "LightPrecip")
# Convert Week days toFactor
day_data$weekday = as.factor(day_data$weekday)
levels(day_data$weekday) = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
# Convert Month toFactor
day_data$mnth = as.factor(day_data$mnth)
levels(day_data$mnth) = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
# Check for Missing Field
day_data = na.omit(day_data)
hour_data = na.omit(hour_data)
Before going in to the modelling we will explore the data set to uncover initial patterns, characteristics, and points of interest. This exploratory analysis will look at the distributions of individual predictors, relationships between predictors and the response, correlation and interaction between predictors as related to response.
This exploratory analysis is included in detailed manner in Appendix section.
We will remove few variables such as Instance and date which are unique to each row, these variables dont contribute much to total ridership.
data = day_data[, c(-1, -2, -14, -15)] # removing instance, date, causal and registered variable (these are not needed for our model).
Function to calculate LOOCV-RMSE
calc_loocv_rmse = function(model) {
temp = (resid(model) / (1 - hatvalues(model))) ^ 2
temp = temp[is.finite(temp)]
sqrt(mean(temp))
}
numerical = unlist(lapply(data, is.numeric)) # contains boolean value whether a varible is having a numeric value or not?
Lets begin by creating a model using only numerical predictors
data_numerical = data[, numerical] # get all the numerical columns
bike_mod_num = lm(cnt ~ ., data = data_numerical) # model with all numeric variables
# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_num)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.7261
calc_loocv_rmse(bike_mod_num) # get the loocv rmse
## [1] 1042
This model has a very high cross validated RMSE and low Adjusted \(R^2\). Hence we can conclude that model is not explaining the response very well.
Let’s now examine the p-values for the coefficients of the model:
summary(bike_mod_num)$coefficients[, 'Pr(>|t|)']
## (Intercept) yr temp atemp hum windspeed
## 7.548e-19 6.263e-109 2.938e-01 4.506e-03 1.193e-15 7.786e-15
The above summary results show that the temp is not very significant predictor as it has a high p-value, this may be because of collinearity between temp and atemp varible.
par(mfrow = c(2, 2))
plot(bike_mod_num,col = 'dodgerblue') # create diagnostics model
The Fitted vs Residual plot does not show constant variance hence violating the assumption of linear regression model. The leverage plot also highlights some outliers.
bike_mod_all = lm(cnt ~ ., data = data) # modelling with all the variables
# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8423
calc_loocv_rmse(bike_mod_all) # get the loocv rmse
## [1] 794.8
By including the categorical variables, Adjusted \(R^2\) has substantially improved and it has also lowered the loocv-rmse.
P-values for coefficients for the model:
summary(bike_mod_all)$coefficients[, 'Pr(>|t|)'] # get the cofficeints
## (Intercept) seasonSummer seasonFall
## 9.775e-10 1.032e-06 1.025e-04
## seasonWinter yr mnthFeb
## 2.280e-17 2.295e-154 3.624e-01
## mnthMar mnthApr mnthMay
## 1.085e-03 6.882e-02 6.145e-03
## mnthJun mnthJul mnthAug
## 6.842e-02 9.219e-01 1.426e-01
## mnthSep mnthOct mnthNov
## 1.652e-04 3.179e-02 6.133e-01
## mnthDec holidayYes weekdayMon
## 6.231e-01 1.130e-03 5.319e-02
## weekdayTue weekdayWed weekdayThu
## 3.982e-03 4.136e-04 3.498e-04
## weekdayFri weekdaySat weathersitMist
## 5.296e-05 4.007e-05 3.156e-09
## weathersitLightPrecip temp atemp
## 5.386e-22 4.153e-02 2.223e-01
## hum windspeed
## 2.014e-07 2.092e-11
The p-values of all the parameters of the additive model indicate that not all of the variables are significant.
Diagnostic plots for the model:
par(mfrow = c(2, 2))
plot(bike_mod_all,col = 'dodgerblue') # create diagnostics model
By including categorical variables also in the model, we still see some issues in the diagnostic plots.
The Fitted vs Residual plot shows a non linear trend also does not show constant variance, hence violating the assumption of linear regression model. We also see presence of some extreme outlier in the plot.
We see some fat tails in the Normal Q-Q plot.
The Residuals vs Leverage plot also indicates presence of some outliers which we might have to check on as we go down the analysis.
Based upon the above results, we will examine below the significance of several of the categorical variables in the response variable:
bike_mod_w_month = lm(cnt ~ . - mnth, data = data) # model without month
bike_mod_w_weekday = lm(cnt ~ . - weekday, data = data) # model without weekday
bike_mod_w_workingday = lm(cnt ~ . - workingday, data = data) # model without workingday
# anova test to compare above three models
anova(bike_mod_w_month,
bike_mod_w_weekday,
bike_mod_w_workingday,
bike_mod_all)
## Analysis of Variance Table
##
## Model 1: cnt ~ (season + yr + mnth + holiday + weekday + workingday +
## weathersit + temp + atemp + hum + windspeed) - mnth
## Model 2: cnt ~ (season + yr + mnth + holiday + weekday + workingday +
## weathersit + temp + atemp + hum + windspeed) - weekday
## Model 3: cnt ~ (season + yr + mnth + holiday + weekday + workingday +
## weathersit + temp + atemp + hum + windspeed) - workingday
## Model 4: cnt ~ season + yr + mnth + holiday + weekday + workingday + weathersit +
## temp + atemp + hum + windspeed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 713 472452877
## 2 707 428442679 6 44010198 12.40 2.7e-13 ***
## 3 702 415401688 5 13040991 4.41 0.00059 ***
## 4 702 415401688 0 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova test shows that month and weekday are significant predictors, hence we cannot rule them out. Even though the month variable is statistically significant, it might be that just few levels are useful and rest of them does not help. We will use model selection schemes later to investigate that.
According to test results working day variable seems to be non significant and we can rule out this variable.
We will now re-fit the model excluding working day:
data_2 = data[, c(-6)] # remove working day variable
bike_mod_all_2 = lm(cnt ~ ., data = data_2) # model with all remaining variable
# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all_2)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8423
calc_loocv_rmse(bike_mod_all_2) # get the loocv rmse
## [1] 794.8
Excluding working day had no effect on the \(R^2\) of the model, so we can safely remove this variable.
Now we will using the Variance Inflation Factor to determine if we have multi-collinearity issues in the model:
library(faraway)
vif(bike_mod_all_2)
## seasonSummer seasonFall seasonWinter
## 7.496 10.720 7.455
## yr mnthFeb mnthMar
## 1.047 1.836 2.624
## mnthApr mnthMay mnthJun
## 5.704 6.868 7.423
## mnthJul mnthAug mnthSep
## 9.444 8.813 6.542
## mnthOct mnthNov mnthDec
## 5.595 4.957 3.184
## holidayYes weekdayMon weekdayTue
## 1.121 1.822 1.730
## weekdayWed weekdayThu weekdayFri
## 1.741 1.743 1.740
## weekdaySat weathersitMist weathersitLightPrecip
## 1.726 1.642 1.338
## temp atemp hum
## 80.807 70.037 2.140
## windspeed
## 1.273
By looking the results of VIF function above, (as we had already suspected by looking to p-value of different variables), temp and atemp have a high level of collinearity. We will now look at the partial correlation coefficient between temp and cnt:
temp_model_1 = lm(temp ~ . - cnt, data = data_2)
temp_model_2 = lm(cnt ~ . - temp, data = data_2)
cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.07684
While this is relatively small, as temp and atemp are highly correlated we should check the partial correlation coefficient after removing atemp:
temp_model_1 = lm(temp ~ . - cnt - atemp, data = data_2)
temp_model_2 = lm(cnt ~ . - temp - atemp, data = data_2)
cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.3801
Let us observe the partial correlation coefficient between atemp and cnt, removing temp:
temp_model_1 = lm(atemp ~ . - cnt - temp, data = data_2)
temp_model_2 = lm(cnt ~ . - atemp - temp, data = data_2)
cor(resid(temp_model_1), resid(temp_model_2))
## [1] 0.3758
These results indicate that temp is more correlated with cnt than atemp is, so we will remove atemp and leave temp:
data_3 = data_2[, -8]
bike_mod_all_3 = lm(cnt ~ ., data = data_3)
# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all_3)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.8422
calc_loocv_rmse(bike_mod_all_3) # get the loocv rmse
## [1] 789.3
This change has slightly lowered the adjusted \(R^2\) as we would expect removing a predictor would, but has improved the LOOCV-RMSE, which indicates that it has improved the model.
As we have concerns about the year predictor, we will also look at the partial correlation coefficient between year and count:
yr_mod_0 = lm(cnt ~ . - yr, data = data_3)
yr_mod_1 = lm(yr ~ . - cnt, data = data_3)
cor(resid(yr_mod_0), resid(yr_mod_1))
## [1] 0.7943
Partial correlation coefficinet is quite high which indicates that the year has a significant relationship with ridership. We have seen that the ridership seems to be increasing from year to year (with some seasonal cycles within that trend.) Since our data only includes two years this may cause problems with using the model to extrapolate to years that are not included in the training set. However, the high partial correlation coefficient indicates that year is an important predictor so we will keep this it in the model.
Next we would like to check for potential outliers, we have 3 different strategy of doing so :
We will be using Cooks Distance to identify any such outlier and see the effect of it on the model.
First, we will calculate the number of observations flagged by Cooks Distance:
sum(cooks.distance(bike_mod_all_3) > 4 / length(cooks.distance(bike_mod_all_3)))
## [1] 44
Next step it to refit a model excluding the identified observations:
cokks_distance = cooks.distance(bike_mod_all_3)
bike_mod_all_4 = lm(cnt ~ .,
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance))
# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all_4)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9086
calc_loocv_rmse(bike_mod_all_4) # get the loocv rmse
## [1] 586.6
Removing these outliers resulted in a substantial increase in \(R^2\) and a substantial decrease in LOOCV-RMSE.
par(mfrow = c(2, 2))
plot(bike_mod_all_4,col = 'dodgerblue') # Create diagnostics
By looking the Residuals vs Fitted plot, we observed that the distribution of variance looks quite constant, although we do still see some non-linear patterns which indicate that we may want to try some higher order terms or transformations.
The Residuals vs Leverage plot also looks much neater and the Normal Q-Q plot has also improved.
We will now evaluate including interactions:
bike_mod_all_5 = lm(cnt ~ . ^ 2,
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance))
# Get Model Parameters - adjusted r-squared and loocv rmse
summary(bike_mod_all_5)[["adj.r.squared"]]
## [1] 0.9467
Including all possible interactions resulted in a substantial improvement in adjusted \(R^2\). However, we can not be sure if this improvement is due to the model or merely due to the inclusion of additional predictors.
calc_loocv_rmse(bike_mod_all_5) # get the loocv rmse
## [1] 759.6
The LOOCV-RMSE has increased, which indicates that the additional terms could have resulted in over fitting the data.
par(mfrow = c(2, 2))
plot(bike_mod_all_5,col = 'dodgerblue') # do diagnostics
length(coef(bike_mod_all_5)) # get the number of params
## [1] 305
The residual vs fitted plot looks random with errors randomly distributed around 0 line, there are still some outliers which are getting highlighted on the plot.
The Q-Q plot looks more normal.
The leverage plot shows some indication of potential new outliers.
Since the model including all possible interactions increased the LOOCV-RMSE, indicating that it was over fitting to the training data, we will perform a backwards AIC step search to remove the non-significant terms.
bike_mod_all_6 = step(bike_mod_all_5, trace = 0, direction = "backward")
length(coef(bike_mod_all_6)) # get the no of params
## [1] 117
summary(bike_mod_all_6)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9482
While decreasing the number of predictors from 305 to 117, the backwards step search also resulted in a very small improvement in adjusted \(R^2\).
calc_loocv_rmse(bike_mod_all_6) # get the loocv rmse
## [1] 470.9
The LOOCV-RMSE also significantly improved, which indicates that this smaller model is a much better model for inference.
par(mfrow = c(2, 2))
plot(bike_mod_all_6,col = 'dodgerblue') # do diagnostics
The Residuals vs Fitted plot still looks normal.
We have previously seen some issues which indicated that polynomial features might improve the model. We will now evaluate including them:
temp_mod = lm(
cnt ~ . + I(temp ^ 2) + I(hum ^ 2) + I(windspeed ^ 2),
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance)
)
bike_mod_all_7 = step(temp_mod, trace = 0, direction = "backward")
summary(bike_mod_all_7)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9203
The adjusted \(R^2\) is lower than it was for our interaction model.
calc_loocv_rmse(bike_mod_all_7) # get the loocv rmse
## [1] 548.5
The LOOCV-RMSE has also increased.
par(mfrow = c(2, 2))
plot(bike_mod_all_7,col = 'dodgerblue') # do diagnostics
The Residual vs Fitted plot does not look random and shows some non-linear pattern, which indicate that the inclusion of these terms has not improved the model. However, \(temp^2\) has a very low p-value which indicates that it may be useful. We will keep this in mind.
Let us evaluate taking a log transformation of the response.
temp_m = lm(log(cnt) ~.^2,
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance))
bike_mod_all_8=step(temp_m, trace=0, direction="backward")
summary(bike_mod_all_8)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9379
The adjusted \(R^2\) has been lowered by this transformation.
par(mfrow = c(2, 2))
plot(bike_mod_all_8,col = 'dodgerblue') # do diagnostics
In addition, we observed issues in both the Residuals vs Fitted plot and the Normal Q-Q plot. We can conclude that this transformation was not helpful.
Finally, since some of the polynomial terms seemed as if they could be significant, we will try a model which includes interactions and polynomial terms:
bike_mod_int_poly = lm(
cnt ~ (. ^ 2) + I(temp ^ 2) + I(hum ^ 2) + I(windspeed ^ 2),
data = data_3,
subset = cokks_distance <= 4 / length(cokks_distance)
)
bike_mod_all_10 = step(bike_mod_int_poly, trace = 0, direction = "backward")
summary(bike_mod_all_10)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9498
The adjusted \(R^2\) is the best we have found yet.
calc_loocv_rmse(bike_mod_all_10) # get the loocv rmse
## [1] 461.4
And this model also results in the lowest LOOCV-RMSE.
Finally we will refit the model using the full data set, find the observations with high leverage and refit the model excluding those items:
bike_mod_int_poly_full = lm(cnt ~ (. ^ 2) + I(temp ^ 2) + I(hum ^ 2) + I(windspeed ^
2),
data = data_3)
bike_mod_all_11 = step(bike_mod_int_poly_full,
trace = 0,
direction = "backward")
cooks_distance = cooks.distance(bike_mod_all_11) # find influential observations and exclude them
filter = cooks_distance <= (4 / length(cooks_distance))
# some points have a cooks distance of NA, we will consider these to be outliers
filter[is.na(filter)] <- FALSE
bike_mod_int_poly_full = lm(cnt ~ (. ^ 2) + I(temp ^ 2) + I(hum ^ 2) + I(windspeed ^
2),
data = data_3,
subset = filter)
bike_mod_all_11 = step(bike_mod_int_poly_full,
trace = 0,
direction = "backward")
summary(bike_mod_all_11)[["adj.r.squared"]] # get the adjusted r-squared
## [1] 0.9582
calc_loocv_rmse(bike_mod_all_11) # get the loocv rmse
## [1] 426.8
Finding and removing the observations with high leverage has improved the LOOCV-RMSE and the \(R^2\).
library(readr)
day_data = read.csv("dataset/day.csv")
# knitr::kable(head(day_data)[, 1:15])
#head(day_data)
hour_data = read.csv("dataset/hour.csv")
# knitr::kable(head(hour_data)[, 1:15])
#head(hour_data)
# spring filtered data
spring_data = subset(day_data, day_data$season == 1)
# summer filtered data
summer_data = subset(day_data, day_data$season == 2)
# fall filtered data
fall_data = subset(day_data, day_data$season == 3)
# winter filtered data
winter_data = subset(day_data, day_data$season == 4)
# remove NA
spring_data = na.omit(spring_data)
summer_data = na.omit(summer_data)
fall_data = na.omit(fall_data)
winter_data = na.omit(winter_data)
Let us take a quick look with the filtered data.
# plot to see the difference
par(mfrow=c(2,2))
hist(spring_data$cnt, col = "dodgerblue",border = "darkorange", xlab = "count", main = "Spring Daily Rental Bike Summary")
hist(summer_data$cnt, col = "dodgerblue",border = "darkorange", xlab = "count", main = "Summer Daily Rental Bike Summary")
hist(summer_data$cnt, col = "dodgerblue",border = "darkorange", xlab = "count", main = "Fall Daily Rental Bike Summary")
hist(winter_data$cnt, col = "dodgerblue",border = "darkorange", xlab = "count", main = "Winter Daily Rental Bike Summary")
We will remove the instant and dteday variables since these are data reference index and so we can use the cor() and pair() functions to see what are the predictors that are highly correlated.
day_data_converted = day_data[3:16]
# convert all factor variables to numeric in order to call cor()
for(name in colnames(day_data_converted)) {
if (is.integer(day_data_converted[[name]]))
day_data_converted[[name]] = as.numeric(day_data_converted[[name]])
}
Let us visually check the correlation between the predictors. We would see immediately that there are 2 interesting sets of variables. 1) temp and atemp 2) registered and cnt.
pairs(day_data_converted, col = "dodgerblue")
Now, let us check using cor() function. We will use cor values > 0.5 to see the predictors that are correlated
all_cor = round(cor(day_data_converted), 2)
(correlated_predictors = sort(abs(all_cor["cnt", abs(all_cor["cnt", ]) > 0.5]), decreasing = TRUE)[-1])
## registered casual temp atemp yr
## 0.95 0.67 0.63 0.63 0.57
Based on the results of cor() computation, the above predictors will be helpful to predict the recommended season to run promotional offers to reach company’s goal of increasing the number of customers.
We will explore for the best model we can use. First, we build a model based on the significant predictors that have high correlation and also build a full model for comparison. Comparing the Adjusted \(R^2\), both of them have high equal values which means the correlation are reliable.
library(knitr)
library(kableExtra)
cor_model = lm(cnt ~ registered + casual + temp + atemp + yr, data = day_data_converted)
summary(cor_model)$adj.r.squared
## [1] 1
full_day_model = lm(cnt ~ ., data = day_data_converted)
summary(full_day_model)$adj.r.squared
## [1] 1
comp_results = data.frame(
"Correlated Model" = c(
"Adjusted R-squared" = summary(cor_model)$adj.r.squared,
"R-Squared" = summary(cor_model)$r.squared
),
"Full Model" = c(
"Adjusted R-squared" = summary(full_day_model)$adj.r.squared,
"R-Squared" = summary(full_day_model)$r.squared
)
)
kable(comp_results) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Correlated.Model | Full.Model | |
|---|---|---|
| Adjusted R-squared | 1 | 1 |
| R-Squared | 1 | 1 |
Both models has good Adjusted \(R^2\). However, we want to check if there are predictors that has high variance inflation factor (VIF) to quantify the effect of collinearity on the variance of our regression estimates. It is also interesting to compare the VIF of the 2 models.
sort(vif(cor_model)[vif(cor_model) > 5], decreasing = TRUE)
## atemp temp
## 61.54 60.59
sort(vif(full_day_model)[vif(full_day_model) > 5], decreasing = TRUE)
## atemp temp registered
## 64.806 63.650 5.993
Based on the above comparison, the atemp and temp are consistent concern for both models because of huge collinearity issue.
Running an F-test, let see what would be the preferred model. We observed through the F-test that we prefer the smaller model.
anova(cor_model, full_day_model)
## Analysis of Variance Table
##
## Model 1: cnt ~ registered + casual + temp + atemp + yr
## Model 2: cnt ~ season + yr + mnth + holiday + weekday + workingday + weathersit +
## temp + atemp + hum + windspeed + casual + registered
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 725 7.61e-22
## 2 717 1.75e-21 8 -9.86e-22
Let’s take the cor_model and remove the 2 highest VIF’s. Then for the remaining variables, instead checking individually, we will do an Exhaustive Search.
cor_model_no_high_vif = lm(cnt ~ registered + casual + yr, data = day_data_converted)
library(leaps)
all_cnt_mod = summary(regsubsets(cnt ~ ., data = day_data_converted, really.big = FALSE))
best_r2_ind = which.max(all_cnt_mod$adjr2)
# extract the predictor of the model
all_cnt_mod$which[best_r2_ind, ]
## (Intercept) season yr mnth holiday weekday
## TRUE FALSE FALSE FALSE FALSE FALSE
## workingday weathersit temp atemp hum windspeed
## FALSE FALSE FALSE FALSE FALSE FALSE
## casual registered
## TRUE TRUE
We don’t think we need to optimize it for LOOCV RMSE since casual and registered predictors has the highest Adjusted \(R^2\).
For computing the mean prediction, we will use the predictors that we identified from the previous process.
best_model = lm(cnt ~ registered + casual, data = day_data)
spring_pred = mean(predict(best_model, newdata = spring_data, interval = c("prediction"), level = 0.99))
summer_pred = mean(predict(best_model, newdata = summer_data, interval = c("prediction"), level = 0.99))
fall_pred = mean(predict(best_model, newdata = fall_data, interval = c("prediction"), level = 0.99))
winter_pred = mean(predict(best_model, newdata = winter_data, interval = c("prediction"), level = 0.99))
# create table for prediction results
pred_results = data.frame(
"prediction" = c(
"Spring" = spring_pred,
"Summer" = summer_pred,
"Fall" = fall_pred,
"Winter" = winter_pred
)
)
kable(pred_results) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| prediction | |
|---|---|
| Spring | 2604 |
| Summer | 4992 |
| Fall | 5644 |
| Winter | 4728 |
Our best model included both interactions and polynomial terms.
summary(bike_mod_all_11)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1683.158 628.17 -2.67947 7.604e-03
## seasonSummer 1007.283 514.23 1.95882 5.066e-02
## seasonFall 5346.488 1355.40 3.94459 9.075e-05
## seasonWinter 1875.157 977.30 1.91871 5.556e-02
## yr 1571.808 290.91 5.40307 9.936e-08
## mnthFeb -36.596 426.17 -0.08587 9.316e-01
## mnthMar -874.914 487.62 -1.79424 7.335e-02
## mnthApr -1649.058 850.65 -1.93860 5.308e-02
## mnthMay 1594.910 1098.36 1.45208 1.471e-01
## mnthJun 5073.201 1479.97 3.42790 6.558e-04
## mnthJul 7675.643 2280.35 3.36600 8.183e-04
## mnthAug 576.367 2076.58 0.27756 7.815e-01
## mnthSep -233.984 1569.62 -0.14907 8.816e-01
## mnthOct -248.871 1256.55 -0.19806 8.431e-01
## mnthNov -406.180 1189.34 -0.34152 7.328e-01
## mnthDec -947.636 981.89 -0.96511 3.349e-01
## holidayYes -4186.149 1605.72 -2.60702 9.392e-03
## weekdayMon 205.959 138.54 1.48668 1.377e-01
## weekdayTue 349.335 136.33 2.56237 1.067e-02
## weekdayWed 404.042 140.25 2.88086 4.127e-03
## weekdayThu 236.222 139.95 1.68793 9.202e-02
## weekdayFri 430.350 141.75 3.03604 2.516e-03
## weekdaySat -16.410 130.45 -0.12579 8.999e-01
## weathersitMist 551.344 350.46 1.57321 1.163e-01
## weathersitLightPrecip -14720.917 155712.10 -0.09454 9.247e-01
## temp 7183.428 1596.36 4.49987 8.373e-06
## hum 3899.354 1665.27 2.34157 1.957e-02
## windspeed 8690.057 1838.65 4.72632 2.937e-06
## I(temp^2) -5972.664 2799.15 -2.13374 3.333e-02
## I(hum^2) -2106.785 1306.74 -1.61224 1.075e-01
## I(windspeed^2) -9860.844 2280.57 -4.32385 1.834e-05
## seasonSummer:yr 1285.263 264.18 4.86517 1.513e-06
## seasonFall:yr 533.058 326.78 1.63125 1.034e-01
## seasonWinter:yr 659.955 313.90 2.10244 3.599e-02
## seasonSummer:holidayYes -872.834 771.16 -1.13185 2.582e-01
## seasonFall:holidayYes -1920.061 1083.22 -1.77255 7.688e-02
## seasonSummer:weekdayMon -719.253 174.75 -4.11580 4.477e-05
## seasonFall:weekdayMon -355.732 172.03 -2.06790 3.914e-02
## seasonWinter:weekdayMon -242.071 174.39 -1.38808 1.657e-01
## seasonSummer:weekdayTue -647.481 170.67 -3.79378 1.656e-04
## seasonFall:weekdayTue -160.706 165.72 -0.96974 3.326e-01
## seasonWinter:weekdayTue 9.149 172.05 0.05318 9.576e-01
## seasonSummer:weekdayWed -710.701 165.60 -4.29155 2.112e-05
## seasonFall:weekdayWed -302.930 169.48 -1.78740 7.445e-02
## seasonWinter:weekdayWed -209.656 173.69 -1.20705 2.280e-01
## seasonSummer:weekdayThu -476.993 167.15 -2.85376 4.490e-03
## seasonFall:weekdayThu -252.243 166.95 -1.51093 1.314e-01
## seasonWinter:weekdayThu -107.387 177.17 -0.60613 5.447e-01
## seasonSummer:weekdayFri -408.567 170.47 -2.39665 1.689e-02
## seasonFall:weekdayFri -308.933 165.83 -1.86301 6.302e-02
## seasonWinter:weekdayFri -229.444 170.85 -1.34292 1.799e-01
## seasonSummer:weekdaySat 231.030 165.28 1.39781 1.628e-01
## seasonFall:weekdaySat 263.169 163.60 1.60862 1.083e-01
## seasonWinter:weekdaySat 371.076 164.71 2.25296 2.467e-02
## seasonSummer:weathersitMist -722.271 329.65 -2.19100 2.889e-02
## seasonFall:weathersitMist -563.892 386.75 -1.45804 1.454e-01
## seasonWinter:weathersitMist -627.249 330.50 -1.89787 5.826e-02
## seasonFall:weathersitLightPrecip -1223.564 28423.50 -0.04305 9.657e-01
## seasonSummer:temp -4982.990 1674.57 -2.97568 3.058e-03
## seasonFall:temp -10478.551 2520.38 -4.15753 3.755e-05
## seasonWinter:temp -4927.243 1542.84 -3.19361 1.489e-03
## seasonSummer:hum 2510.785 1053.22 2.38392 1.748e-02
## seasonFall:hum 1984.759 1633.72 1.21487 2.250e-01
## seasonWinter:hum 2244.421 1554.85 1.44349 1.495e-01
## yr:mnthFeb -23.033 176.96 -0.13016 8.965e-01
## yr:mnthMar 161.434 214.44 0.75280 4.519e-01
## yr:mnthApr -933.973 348.40 -2.68078 7.575e-03
## yr:mnthMay -1730.267 378.15 -4.57566 5.926e-06
## yr:mnthJun -1460.192 392.56 -3.71966 2.209e-04
## yr:mnthJul -906.095 447.75 -2.02364 4.351e-02
## yr:mnthAug -394.949 430.68 -0.91705 3.595e-01
## yr:mnthSep -336.953 403.59 -0.83490 4.042e-01
## yr:mnthOct -25.876 382.07 -0.06772 9.460e-01
## yr:mnthNov -519.131 374.19 -1.38734 1.659e-01
## yr:mnthDec -600.743 317.58 -1.89162 5.909e-02
## yr:weekdayMon 447.472 119.09 3.75738 1.909e-04
## yr:weekdayTue 357.617 119.33 2.99684 2.857e-03
## yr:weekdayWed 427.096 121.77 3.50737 4.912e-04
## yr:weekdayThu 655.717 121.02 5.41841 9.162e-08
## yr:weekdayFri 525.160 118.57 4.42901 1.152e-05
## yr:weekdaySat 583.690 117.20 4.98020 8.622e-07
## yr:weathersitMist -289.958 94.12 -3.08079 2.172e-03
## yr:temp 1712.949 558.46 3.06726 2.271e-03
## yr:hum -968.212 371.61 -2.60549 9.434e-03
## yr:windspeed -942.668 487.55 -1.93348 5.371e-02
## mnthFeb:weathersitMist -113.545 210.18 -0.54023 5.893e-01
## mnthMar:weathersitMist -165.057 263.33 -0.62682 5.311e-01
## mnthApr:weathersitMist -344.206 419.60 -0.82032 4.124e-01
## mnthMay:weathersitMist 405.749 428.55 0.94679 3.442e-01
## mnthJun:weathersitMist -286.810 446.78 -0.64196 5.212e-01
## mnthJul:weathersitMist -330.613 503.78 -0.65626 5.119e-01
## mnthAug:weathersitMist -75.573 490.25 -0.15415 8.775e-01
## mnthSep:weathersitMist -262.345 444.44 -0.59028 5.553e-01
## mnthOct:weathersitMist -121.264 406.19 -0.29854 7.654e-01
## mnthNov:weathersitMist 403.136 387.55 1.04021 2.987e-01
## mnthDec:weathersitMist 544.194 340.08 1.60020 1.102e-01
## mnthOct:weathersitLightPrecip -2160.039 29281.34 -0.07377 9.412e-01
## mnthFeb:temp -537.529 1124.60 -0.47797 6.329e-01
## mnthMar:temp 3187.552 1393.64 2.28721 2.258e-02
## mnthApr:temp 8776.006 2373.97 3.69677 2.412e-04
## mnthMay:temp 7050.073 2851.35 2.47254 1.373e-02
## mnthJun:temp 1241.808 3251.55 0.38191 7.027e-01
## mnthJul:temp -2171.925 4051.26 -0.53611 5.921e-01
## mnthAug:temp 7227.527 3819.79 1.89213 5.902e-02
## mnthSep:temp 9624.696 3315.20 2.90320 3.848e-03
## mnthOct:temp 7343.715 2337.15 3.14217 1.771e-03
## mnthNov:temp 4487.196 2367.01 1.89573 5.854e-02
## mnthDec:temp 5835.861 1806.21 3.23100 1.311e-03
## mnthFeb:hum 646.600 723.38 0.89385 3.718e-01
## mnthMar:hum 382.661 828.31 0.46198 6.443e-01
## mnthApr:hum -1073.359 1289.66 -0.83228 4.056e-01
## mnthMay:hum -4150.273 1335.76 -3.10706 1.991e-03
## mnthJun:hum -3742.575 1339.23 -2.79457 5.386e-03
## mnthJul:hum -3628.858 1832.90 -1.97984 4.824e-02
## mnthAug:hum -3939.536 1854.50 -2.12431 3.411e-02
## mnthSep:hum -4590.272 1874.44 -2.44887 1.466e-02
## mnthOct:hum -2888.401 1841.23 -1.56874 1.173e-01
## mnthNov:hum -1784.942 1767.64 -1.00979 3.131e-01
## mnthDec:hum -1577.334 1554.63 -1.01460 3.108e-01
## holidayYes:hum 6725.199 3288.04 2.04535 4.132e-02
## weekdayMon:weathersitMist 200.223 135.30 1.47984 1.395e-01
## weekdayTue:weathersitMist -69.502 135.05 -0.51464 6.070e-01
## weekdayWed:weathersitMist -47.747 135.91 -0.35131 7.255e-01
## weekdayThu:weathersitMist 29.179 134.84 0.21640 8.288e-01
## weekdayFri:weathersitMist -159.464 136.16 -1.17113 2.421e-01
## weekdaySat:weathersitMist -162.242 137.11 -1.18329 2.372e-01
## weekdayMon:weathersitLightPrecip 1433.004 3176.30 0.45116 6.521e-01
## weathersitMist:temp 1859.348 556.17 3.34314 8.872e-04
## weathersitLightPrecip:temp 27925.347 340979.07 0.08190 9.348e-01
## weathersitMist:hum -1743.948 543.31 -3.20988 1.409e-03
## hum:windspeed -9939.237 1881.87 -5.28158 1.877e-07
It uses 151 predictors, which is quite a lot, and the p-values for many of them indicate that they may not be significant. However, it appears that the non-significant terms are levels for factor variables which may be difficult to remove.
plot(
bike_mod_all_11$fitted.values,
data_3$cnt[filter],
main = "Fitted Values vs Actual",
xlab = "Fitted Values",
ylab = "Ground Truth",
col = "darkblue",
pch = 20
)
abline(0, 1, col = "firebrick4", lwd = 3)
We also observed that the fitted value for this model are quite close to the actual values.
summary(bike_mod_all_11)$adj.r
## [1] 0.9582
The adjusted \(R^2\) indicates that the model explains 0.9582 of the variance in the data.
The LOOCV-RMSE of the model is 426.8036.
par(mfrow = c(2, 2))
plot(bike_mod_all_11,col = 'dodgerblue') # do diagnostics
The diagnostic plots all appear to be acceptable, although there are still some points with very high leverage.
sqrt(mean(bike_mod_all_11$residuals^2))
## [1] 343.8
The model has an RMSE of 344 which means that on an average the model predictions are off by this much amount.
# Count the number of high VIF's in the model
sum(vif(bike_mod_all_11) > 10)
## Warning in v1 * v2: longer object length is not a multiple of shorter object
## length
## [1] 96
This was expected due to high number of categorical variables and their interactions. Since, we are mainly focusing on demand prediction, we tuned our model for better prediction by sacrificing interpretability.
hist(resid(bike_mod_all_11),
col = "lightslateblue",
main = "Histogram of Residuals", xlab = "Residuals")
The histogram of residuals looks nearly normal which confirms the fact that the model has noise which are normally distributed and centered about 0.
So far, we have tried multiple approaches to evolve our understanding of what should work to have a model with a good performance without sacrificing the high variance nature of the model. At this point, we have reached a decent spot where we have a model with high adjusted \(R^2\) value and low LOOCV RMSE.
Based on the F-test, we observed that we prefer the smaller model we got from checking the collinearity which has the predictors – registered, casual, temp, atemp and yr . From the preferred model, we decided to find the best model by removing the highest VIF and we ran exhaustive search to check every possible model. We ended up using registered and casual variables to compute the predictions for each season.
While the final model contained a large number of predictors, making it difficult to interpret. The results indicate that it is very good at explaining the relationship between weather, season and bike sharing rentals. We tried evaluating using BIC to reduce the size of the final model, however, doing so resulted in a lowered LOOCV-RMSE, so we preferred the AIC selected model.
As expected, the weather situation is an especially important predictor. Both by itself and in its interactions with other predictors, indicating that rain has a significant impact on the number of rentals, especially the interaction between light rain and windspeed.
The high adjusted \(R^2\) of the model indicates that a very large portion of the variance in the data can be explained by this model, which would make it very useful for predicting demand for bikes.
Future Improvisations:
Looking at the distribution of data between demand and type of day as well as from the outlier detection through cook’s method it might be suitable to build separate models for Holidays, Workdays and Weekends.
We can extend this model to apply Seasonality analysis to see how season and days of week affect the response as well analyze the overall demand trend year on year after treating seasonal component in the data.
On our preliminary assumption, Winter will be the season that would need an enhanced marketing strategy but it turned out upon finding the best model, Spring is the season that would require a better marketing strategy to attract additional customers. Interrogating the best model best_model = lm(cnt ~ registered + casual, data = day_data), the value of cnt is the sum of registered and casual bike riders, which make it highly correlated to the response.
If given additional time, we would like to explore why Spring season got the lowest prediction compared to our initial assumption of Winter season. Are there any methods that would lead us to a different best model? Are people biking a lot during Winter because they want the excitement of having snow and cold temperature? It will also be interesting to gather data around how often a customer rents a bike, a comparison of registered versus casual bike riders, and the type of bicycle – mountain bike, fat bike, or road bike.
#pairs(day_data, col="dodgerblue")
pairs(day_data[,10:16],
main = "Pairs Plot for Numeric Features",
col="dodgerblue"
)
By looking the pairs plot we can deduce that - - there is a relationship between registered users and total ridership count(cnt). - there is also a relationship between temp and cnt.
cor(day_data[,10:16])
## temp atemp hum windspeed casual registered cnt
## temp 1.0000 0.9917 0.12696 -0.1579 0.54328 0.54001 0.6275
## atemp 0.9917 1.0000 0.13999 -0.1836 0.54386 0.54419 0.6311
## hum 0.1270 0.1400 1.00000 -0.2485 -0.07701 -0.09109 -0.1007
## windspeed -0.1579 -0.1836 -0.24849 1.0000 -0.16761 -0.21745 -0.2345
## casual 0.5433 0.5439 -0.07701 -0.1676 1.00000 0.39528 0.6728
## registered 0.5400 0.5442 -0.09109 -0.2174 0.39528 1.00000 0.9455
## cnt 0.6275 0.6311 -0.10066 -0.2345 0.67280 0.94552 1.0000
hist(day_data$cnt,
col = "dodgerblue",
main = "Histogram of Total Ridership Count",
xlab = "Total Ridership Count")
It looks like, the bike ridership count variable is following normal distribution.
par(mfrow=c(2,2))
hist(day_data[,10], main="Temperature", xlab = "Temperature (Celsius)", col = "darkorange")
hist(day_data[,11], main="ATemp", xlab = "Normalized Temperature", col = "purple")
hist(day_data[,12], main="Humidity", xlab = "Humidity", col = "blue")
hist(day_data[,13], main="Windspeed", xlab = "Windspeed", col = "skyblue")
These plots do not exactly looks like normally distributed, however windspeed and humidity looks close to Normal Distribution.
par(mfrow = c(2,3))
barplot(prop.table(table(day_data$season)), col = 1:4, main = "Distribution of Seasons", xlab = "Season", ylab = "Count")
barplot(prop.table(table(day_data$mnth)), col = 1:12, main = "Distribution of Months", xlab = "Month", ylab = "Count")
barplot(prop.table(table(day_data$weekday)), col = 1:12, main = "Distribution of Weekdays", xlab = "Weekday", ylab = "Count")
barplot(prop.table(table(day_data$holiday)), col = 6:12, main = "Distribution of Holidays", xlab = "Holiday", ylab = "Count")
barplot(prop.table(table(day_data$workingday)), col = 8:12, main = "Distribution of Working Days", xlab = "Working Day", ylab = "Count")
barplot(prop.table(table(day_data$weathersit)), col = 12:15, main = "Distribution of Weather Types", xlab = "Weather Type", ylab = "Count")
We can conclude follwing points by seeing the above plots -
par(mfrow=c(2,2))
plot(day_data$temp ~ day_data$cnt,
data = day_data,
main = "Temperature vs Total Bike Ridership",
xlab = "Temperature",
ylab = "Total Bike Ridership", col = 2)
plot(day_data$hum ~ day_data$cnt,
data = day_data,
main = "Humidity vs Total Bike Ridership",
ylab = "Total Bike Ridership",
xlab = "Humidity", col = 3)
plot(day_data$windspeed ~ day_data$cnt,
data = day_data,
main = "Windspeed vs Total Bike Ridership",
ylab = "Total Bike Ridership",
xlab = "Windspeed", col = 4)
By looking the above plot, it looks like that there is a linear relationship between temperature and total bike ridership, however it could be more polynomial. Nothing concretely can be said about the relationship between Humidity and Total Bike Ridership and Windspeed and Total Bike Ridership.
par(mfrow=c(2,2))
plot(day_data$windspeed ~day_data$hum,
data = day_data,
main = "Windspeed vs Humidity",
ylab = "Humidity",
xlab = "Windspeed", col = 5)
plot(day_data$windspeed ~ day_data$temp,
data = day_data,
main = "Windspeed vs Temperature",
ylab = "Temperature",
xlab = "Windspeed", col = 6)
plot(day_data$hum ~ day_data$temp,
data = day_data,
main = "Humidity vs Temperature",
ylab = "Temperature",
xlab = "Humidity", col = 7)
By looking the above plot, it looks like that there is a some relationship between Windspeed, Humidity and Temperature, it could be in some polynomial in nature, we need to explore more to know the true nature of these relationships.
par(mfrow = c(1,2))
boxplot(cnt ~ weathersit,
data = day_data,
col = 2:4,
main = "Count by Weather",
xlab = "Weather",
ylab = "Total Ridership")
boxplot(cnt ~ season,
data = day_data,
col = 6:12,
main = "Count by Season",
xlab = "Season",
ylab = "Total Ridership")
boxplot(cnt ~ mnth,
data = day_data,
col = 2:14,
main = "Count by Month",
xlab = "Months",
ylab = "Total Ridership")
boxplot(cnt ~ weekday,
data = day_data,
col = 2:8,
main = "Count by Weekday",
xlab = "Weekday",
ylab = "Total Ridership")
boxplot(cnt ~ workingday,
data = day_data,
col = 10:12,
main = "Count by Working Day",
ylab = "Total Ridership")
boxplot(cnt ~ holiday,
data = day_data,
col = 12:15,
main = "Count by Holiday",
ylab = "Total Ridership")
By seeing the above plots we can deduce that - - Total Ridership increases on Clear weather. - Total Ridership increases on Summer and Fall weather. - Total Ridership incrases on June, July, August and Sepetember. - Weekday and Working Day don’t offer much interest from boxplot. - Mean Ridership on Holidays is lower than on non-Holidays
We will now consider the interactions between the categorical features we identified above and some numerical features.
par(mfrow = c(1,2))
plot(day_data$temp, day_data$cnt,
col = day_data$season,
pch = 20,
main = "Ridership vs Temperature by Seasons",
xlab = "Temperature",
ylab = "Ridership")
legend("topleft", legend = c("Spring", "Summer", "Fall", "Winter"),
col = 1:5, lwd = 2, lty = c(0,0), pch = 20)
plot(day_data$hum,
day_data$cnt,
col = day_data$season,
pch = 20,
main = "Ridership vs Humidity by Season",
xlab = "Humidity",
ylab = "Ridership")
legend("topleft", legend = c("Spring", "Summer", "Fall", "Winter"),
col = 1:5, lwd = 2, lty = c(0,0), pch = 20)
As we would expect the temperature to be correlated to the season the Ridership vs Temperature plot doesn’t show any interesting patterns. However, the Ridership vs Humidity plot has vertical clusters which could indicate interesting correlations.
par(mfrow=c(1,2))
plot(day_data$temp,
day_data$cnt,
col = day_data$weathersit,
pch = 20,
main = "Ridership vs Temperature by Weather",
xlab = "Temperature",
ylab = "Ridership")
legend("topleft", legend = c("Clear", "Misty", "Light Precip"), col = 1:5, lwd = 2, lty = c(0,0), pch = 20)
plot(day_data$hum,
day_data$cnt,
col = day_data$weathersit,
pch = 20,
main = "Ridership vs Humidity by Weather",
xlab = "Humidity",
ylab = "Ridership")
legend("topleft", legend = c("Clear", "Misty", "Light Precip"), col = 1:5, lwd = 2, lty = c(0,0), pch = 20)
Similarly, as we would expect Humidity to be correlated with Weather the Ridership vs Humidity plot doesn’t show any interesting patterns while the Ridership vs Temperature plot indicates that there may be some
par(mfrow = c(1,2))
plot(day_data$temp,
day_data$cnt,
col = day_data$mnth,
pch = 20,
main = "Ridership vs Temperature by Month",
xlab = "Temperature",
ylab = "Ridership")
plot(day_data$hum,
day_data$cnt,
col = day_data$mnth,
pch = 20,
main = "Ridership vs Humidity by Month",
xlab = "Temp",
ylab = "Ridership")
legend("topleft", legend = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), col = 1:12, lwd = 1, lty = c(0,0), pch = 20)
It is difficult to draw conclusions from the plots above, however both interaction may merit further investigation.
This report was prepared by Team Outlier – Michael Alpas, Mohit Singh and Upendra Yadav.